Teams

The Old Bailey

In this project, we propose to use Cilibrasi & Vitanyi's method of clustering by compression (2005) as a domain-general way of finding related records across boundaries imposed by human categorizers. For our archive, we will use the Old Bailey Proceedings Online, a massive digital collection of trials held in Britain's foremost criminal court between 1674 and 1913. These records have been marked up with XML and hand categorized according to a system of 56 specific types of crime in 9 general categories. While these categories are useful for some kinds of historical research, they tend to obscure similarities across different kinds of crime. We plan to use HPC to compute Cilibrasi and Vitanyi's normalized compression distance for every single pair of records in the OB archive. We will implement the functionality within the Voyeur Tools suite, an online text analysis environment for large-scale corpora. (We are currently in the process of migrating Voyeur to Sharcnet).      The Old Bailey archive will provide a useful test-case for massive clustering operations, but the tool we will prototype is likely to be broadly useful to text researchers for clustering across categorization boundaries in large-scale corpora.

Stéfan Sinclair, Associate Professor of Multimedia at McMaster University, is involved in the design and development of analysis and visualization tools for the digital humanities.

Cyril Briquet is a postdoctoral fellow in digital humanities and high performance computing at McMaster University. He is currently working to scale out the analytics backend of Voyeur Tools.